import pandas as pd
import altair as alt
import numpy as np
Recap of the Data, Goals, and Task
Overview of the Data:
The dataset I selected from Kaggle contains over two decades of NBA player data spanning from the 1996 to 2023 seasons. The Kaggle dataset (link to download the data set: https://www.kaggle.com/datasets/justinas/nba-players-data?resource=download) covers physical demographics of players (e.g., height), biographical details of players (e.g., when they were drafted, which team they are on for a season), and averaged statistical metrics for a season (points, rebounds, assists, usage percentage, and true shooting percentage). I chose this dataset because I wanted to explore the relationship between a player’s usage percentage, which is a basketball statistic that shows how many offensive team plays a player uses while they are on the court, their scoring efficiency (also known as the True Shooting Percentage), and overall production (Points, Rebounds, and Assists added together). Specifically, I was interested in whether high-usage players tend to score more, whether those high scorers are also efficient shooters, and whether high usage overlaps with high overall statistical contribution.
Goals:
Given my background in sports statistics, specific statistical measures can help identify elite players. For instance, a player with low usage but a high true shooting percentage typically fits the profile of an NBA role player. A player with both high usage and high true shooting percentage should represent an elite NBA star. Finally, a player with high usage paired with a low true shooting percentage would signal an inefficient scorer. Similar logic would apply to the overall statistical contribution. With this logic, elite players should show strong scoring, rebounding, and assisting numbers alongside high usage and efficient shooting. My goal was to design visualizations that would show these statistical relationships and make it easy for users who are not basketball statistics experts to identify players who fit each category.
Tasks:
Because the dataset contains more than 12,000 rows, I first need to reduce its size to ensure that Altair can process it efficiently. As a result, I will look at the five most recent seasons in the data set. Filtering the dataset before the visualization process will yield about 2,700 rows for analysis and interpretation. I also plan to create categorical groupings for usage percentage, true shooting percentage, and overall statistical contribution to identify what constitutes top-of-the-line statistics in the NBA. These groups will improve readability and make filtering more intuitive, allowing users to quickly drill down into areas of interest. Numeric dropdown filters are impractical because they produce long lists to select from. Meanwhile, grouping the data into meaningful tiers provides a more user-friendly experience and helps non-expert users quickly understand what qualifies as an elite performance.
To answer the analytical questions that I posed, I will use both histograms and scatter plots. The histogram will show the distribution of average points per game, with a filter for usage percentage, allowing users to see whether higher-usage players tend to score more points per game on average. The first scatter plot will map a player's usage percentage against their true shooting percentage, with color representing the average points per game they scored during a season. At first, I planned to present the differences in points scored by using point size in the scatter plot. However, after feedback from three family members, I shifted to a divergent color scale, which makes scoring differences easier to interpret at a glance, and stopped the overcrowding that was happening in the graph. The second scatter plot will compare usage percentage to overall statistical contribution, helping identify whether elite players typically have the offense run through them and how productive they are as all-around contributors.
The primary goal of the visualizations is to make it easy for users to identify the NBA's elite players and what exactly makes them elite. While sports analysts and coaches would eventually use the visualization to compare NBA players, a basic fan could use it to identify elite players and understand how usage, scoring efficiency, and overall contribution interact.
NBA_DS = pd.read_csv('all_seasons.csv')
NBA_DS.head()
| Unnamed: 0 | player_name | team_abbreviation | age | player_height | player_weight | college | country | draft_year | draft_round | ... | pts | reb | ast | net_rating | oreb_pct | dreb_pct | usg_pct | ts_pct | ast_pct | season | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Randy Livingston | HOU | 22.0 | 193.04 | 94.800728 | Louisiana State | USA | 1996 | 2 | ... | 3.9 | 1.5 | 2.4 | 0.3 | 0.042 | 0.071 | 0.169 | 0.487 | 0.248 | 1996-97 |
| 1 | 1 | Gaylon Nickerson | WAS | 28.0 | 190.50 | 86.182480 | Northwestern Oklahoma | USA | 1994 | 2 | ... | 3.8 | 1.3 | 0.3 | 8.9 | 0.030 | 0.111 | 0.174 | 0.497 | 0.043 | 1996-97 |
| 2 | 2 | George Lynch | VAN | 26.0 | 203.20 | 103.418976 | North Carolina | USA | 1993 | 1 | ... | 8.3 | 6.4 | 1.9 | -8.2 | 0.106 | 0.185 | 0.175 | 0.512 | 0.125 | 1996-97 |
| 3 | 3 | George McCloud | LAL | 30.0 | 203.20 | 102.058200 | Florida State | USA | 1989 | 1 | ... | 10.2 | 2.8 | 1.7 | -2.7 | 0.027 | 0.111 | 0.206 | 0.527 | 0.125 | 1996-97 |
| 4 | 4 | George Zidek | DEN | 23.0 | 213.36 | 119.748288 | UCLA | USA | 1995 | 1 | ... | 2.8 | 1.7 | 0.3 | -14.1 | 0.102 | 0.169 | 0.195 | 0.500 | 0.064 | 1996-97 |
5 rows × 22 columns
#Filtering my data set down so that there are no errors with altair
Filtered_NBA_DS= NBA_DS[(NBA_DS['season'] == '2022-23' ) | (NBA_DS['season'] == '2021-22') | (NBA_DS['season'] == '2020-21') | (NBA_DS['season'] == '2019-20') | (NBA_DS['season'] == '2018-19')].copy()
Filtered_NBA_DS.count()
Unnamed: 0 2743 player_name 2743 team_abbreviation 2743 age 2743 player_height 2743 player_weight 2743 college 2337 country 2743 draft_year 2743 draft_round 2743 draft_number 2743 gp 2743 pts 2743 reb 2743 ast 2743 net_rating 2743 oreb_pct 2743 dreb_pct 2743 usg_pct 2743 ts_pct 2743 ast_pct 2743 season 2743 dtype: int64
#Creating the usage group to act as a filter
def usage(usg_pct):
if usg_pct < .15:
return 'Low Usage (0-15%)'
elif usg_pct < .22:
return 'Medium Usage (15-22%)'
else:
return 'High Usage (22%+)'
Filtered_NBA_DS['Usage_Group'] = Filtered_NBA_DS['usg_pct'].apply(usage)
#Creating the True Shooting group to act as a filter
def true_shooting_pct(ts_pct):
if ts_pct < .53:
return 'Inefficient True Shooting Percentage (0-53%)'
elif ts_pct < .58:
return 'Average True Shooting Percentage (53-58%)'
elif ts_pct < .62:
return 'Efficent True Shooting Percentage (58-62%)'
else:
return "Elite True Shooting Percentage (62%+)"
Filtered_NBA_DS['True_Shooting_Percentage_Group'] = Filtered_NBA_DS['ts_pct'].apply(true_shooting_pct)
Filtered_NBA_DS['Overall_Statistical_Contribution'] = Filtered_NBA_DS[['pts', 'reb', 'ast']].sum(axis=1)
#Creating the Overall Statistical Contribution group to act as a filter
def PRA_Contribution(PRA):
if PRA < 22:
return 'Role Player (<22 PRA)'
elif PRA < 28:
return 'Average NBA Starter (22-28 PRA)'
elif PRA < 35:
return 'High-End NBA Starter (28-35 PRA)'
elif PRA < 45:
return 'All Star Level (35-45 PRA)'
else:
return "Elite Superstar (45+ PRA)"
Filtered_NBA_DS['Overall_Contribution_Group'] = Filtered_NBA_DS['Overall_Statistical_Contribution'].apply(PRA_Contribution)
Filtered_NBA_DS.head()
| Unnamed: 0 | player_name | team_abbreviation | age | player_height | player_weight | college | country | draft_year | draft_round | ... | oreb_pct | dreb_pct | usg_pct | ts_pct | ast_pct | season | Usage_Group | True_Shooting_Percentage_Group | Overall_Statistical_Contribution | Overall_Contribution_Group | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10101 | 10101 | Zhou Qi | HOU | 23.0 | 215.90 | 95.25432 | NaN | China | 2016 | 2 | ... | 0.000 | 0.000 | 0.333 | 1.000 | 0.000 | 2018-19 | High Usage (22%+) | Elite True Shooting Percentage (62%+) | 2.0 | Role Player (<22 PRA) |
| 10102 | 10102 | Frank Mason | SAC | 25.0 | 180.34 | 86.18248 | Kansas | USA | 2017 | 2 | ... | 0.013 | 0.083 | 0.213 | 0.502 | 0.280 | 2018-19 | Medium Usage (15-22%) | Inefficient True Shooting Percentage (0-53%) | 8.4 | Role Player (<22 PRA) |
| 10103 | 10103 | Frank Ntilikina | NYK | 20.0 | 198.12 | 90.71840 | NaN | France | 2017 | 1 | ... | 0.012 | 0.080 | 0.163 | 0.417 | 0.195 | 2018-19 | Medium Usage (15-22%) | Inefficient True Shooting Percentage (0-53%) | 10.5 | Role Player (<22 PRA) |
| 10104 | 10104 | Fred VanVleet | TOR | 25.0 | 182.88 | 88.45044 | Wichita State | USA | Undrafted | Undrafted | ... | 0.012 | 0.078 | 0.177 | 0.539 | 0.243 | 2018-19 | Medium Usage (15-22%) | Average True Shooting Percentage (53-58%) | 18.4 | Role Player (<22 PRA) |
| 10105 | 10105 | Furkan Korkmaz | PHI | 21.0 | 200.66 | 86.18248 | NaN | Turkey | 2016 | 1 | ... | 0.023 | 0.130 | 0.174 | 0.528 | 0.107 | 2018-19 | Medium Usage (15-22%) | Inefficient True Shooting Percentage (0-53%) | 9.1 | Role Player (<22 PRA) |
5 rows × 26 columns
# Create a histogram for the 'pts' column
dropdown = alt.binding_select (options=Filtered_NBA_DS['Usage_Group'].unique(), name="Select a Usage Group for the Histogram:")
selection = alt.selection(type="single", fields=['Usage_Group'], bind=dropdown)
base = alt.Chart(Filtered_NBA_DS)
hist = base.mark_bar().encode(
x = alt.X("pts", bin=alt.Bin(maxbins=8)),
y = "count()"
).add_selection(selection).transform_filter(selection).properties(title="Distribution of Average Points per Game by Usage Group")
hist.show()
C:\Users\leahf\AppData\Local\Temp\ipykernel_24836\3598327151.py:3: AltairDeprecationWarning: Deprecated since `altair=5.0.0`. Use 'selection_point()' or 'selection_interval()' instead. These functions also include more helpful docstrings. selection = alt.selection(type="single", fields=['Usage_Group'], bind=dropdown) C:\Users\leahf\AppData\Local\Temp\ipykernel_24836\3598327151.py:9: AltairDeprecationWarning: Deprecated since `altair=5.0.0`. Use add_params instead. ).add_selection(selection).transform_filter(selection).properties(title="Distribution of Average Points per Game by Usage Group")
dropdown = alt.binding_select (options=Filtered_NBA_DS['draft_round'].unique(), name="Select a Player's draft round for the Scatter Plot: ")
selection = alt.selection(type="single", fields=['draft_round'], bind=dropdown)
Scatter_plot_firt_try = alt.Chart(Filtered_NBA_DS).mark_circle().encode(
x = "usg_pct",
y = "ts_pct",
color=alt.Color('team_abbreviation', scale=alt.Scale(scheme='tableau20')),
tooltip=['team_abbreviation', "player_name", "age", "player_height", "ts_pct","usg_pct", "pts", "ast", "reb" ],
size = "pts",
opacity=alt.condition(selection,alt.value(1),alt.value(.2))
).add_selection(selection).properties( title="NBA Player's Usage Percentage vs. True Shooting Percentage")
Scatter_plot_firt_try.show()
C:\Users\leahf\AppData\Local\Temp\ipykernel_24836\1796708521.py:2: AltairDeprecationWarning: Deprecated since `altair=5.0.0`. Use 'selection_point()' or 'selection_interval()' instead. These functions also include more helpful docstrings. selection = alt.selection(type="single", fields=['draft_round'], bind=dropdown) C:\Users\leahf\AppData\Local\Temp\ipykernel_24836\1796708521.py:11: AltairDeprecationWarning: Deprecated since `altair=5.0.0`. Use add_params instead. ).add_selection(selection).properties( title="NBA Player's Usage Percentage vs. True Shooting Percentage")
First Visualization I Gave my Family for Evaluation.
histogram | Scatter_plot_firt_try
Evaluation Approach:
As stated before, the target questions the visualization aims to answer are whether users can identify the NBA's elite players and what exactly makes them elite, and whether they can see patterns between a player's usage percentage and overall contribution statistics. Since the visualization aims to give users insight into which NBA players are elite, the best evaluation approach would be qualitative. Since I created a low-fi version of the visualizations before recruiting my husband, mother-in-law, and father-in-law for testing, I opted for a think-aloud study. My husband and father-in-law are huge basketball fans, so they have domain knowledge to assess whether my visualization seemed accurate and whether it honestly answered the questions that are under investigation. At the same time, my mother-in-law is just a typical fan and isn't into player stats. However, she would be helpful in the evaluation phase because she should be able to look at the visualizations and tell who is an elite player and who is not, and give me, the creator, insight into whether or not NBA fans could widely use this tool, and not just a person who is obsessed with player statistics.
From there, I recruited my testers to play with my low-fi visualization and tell me what they gained from the experience after a couple of minutes. My first design is in this Jupyter notebook, labeled "First Visualization I Gave my Family for Evaluation." As we can see, I have a histogram that filters by usage groups, showing the distribution of average points per game by players, including high-, medium-, and low-usage players. Having a filter that groups players by high, medium, or low usage could show that high-usage players score more points. There was also a scatter plot that highlighted when a player was drafted, used a color scheme to indicate which team a player played for, and enlarged the plot points to show the average point differential between players.
My mother-in-law also didn't understand what a true-shooting percentage was and had no basis for what would be considered an elite percentage, so she thought the plot didn't help her decide who would be an elite player. Finally, my husband thought that while usage percentage, true-shooting percentage, and points were good places to look for elite players, they weren't enough for him to reach that conclusion. To him, I was ignoring other aspects of the game that are important for determining who is a truly elite player in the NBA. Therefore, he recommended that I also consider the overall contribution rate, because we could be removing players from elite status even though they have very high rebound and assist totals.
Key Elements of the Final Design:
Based on feedback from the initial prototype, I changed the following aspects of my design to make it more user-friendly and better aligned with the overall goal of identifying elite players, identifying factors that make players elite, and assessing whether elite players tend to have a higher usage percentage. The list below explains what was edited, and why it helps meet the intended goals:
Left the histogram the same, as the feedback I received was positive, and my testers were getting the insights that I wanted them to obtain from the histogram.
Added a season filter. This allows the user to filter by season and see whether a player was playing at an elite level during that time period. That way, we avoid duplicate players in the scatter plot.
Removed the size parameter for the average points scored per game. This improved readability by removing that cluttered look.
Removed the categorical color scheme for the NBA teams and replaced it with a diverging color scheme with the points scored for a player. This will show which players' average high points scored per game more intuitively than the size attribute.
Added a column called True_Shooting_Percentage_Group. This groups the true percentages into inefficient, average, efficient, and elite categories. After creating the column, it is used to highlight the players in the scatterplot who fall into each group based on their true shooting percentage. That way, we can point out elite scorers more easily for people who are not well-versed in basketball statistics.
Added another scatter plot that looks at usage percentage by overall statistical contributions, so we can see elite players because they also put up high numbers of rebounds and assists.
Added another column called Overall_Contribution_Group, which shows which players are Elite, All-Stars, High-End Starters, Average NBA Starters, and Role Players based on their average per-season points, rebounds, and assists (PRA). This will be used in the color scheme for that plot, so people can easily identify elite players and whether they have a high usage percentage.
Added a filter column for seasons on the new scatter plot called "NBA Players' Usage Percentage vs. Overall Statistical Contribution," to avoid the same problem that we were having on the first scatter plot, where we are seeing duplicate players.
All of the design adjustments make the visualization more interpretable, reduce confusion caused by duplicate data points, and highlight elite players more clearly through intuitive color groupings. I performed another think-aloud with the same testers after the changes were made, and they were able to understand and gain the insights I wanted them to more clearly. As a result, the design changes produced a more effective way to convey the data and show which players' statistics align with elite players. The final design I presented to my testers is called "Final Visualizations After Feedback," which serves as a dashboard for all the visualizations (the histogram and two scatter plots) created.
dropdown = alt.binding_select (options=Filtered_NBA_DS['True_Shooting_Percentage_Group'].unique(), name="Select a Player's True Shooting Percentage for the Efficency Scatter Plot: ")
selection = alt.selection(type="single", fields=['True_Shooting_Percentage_Group'], bind=dropdown)
Scatter_plot = alt.Chart(Filtered_NBA_DS).mark_circle().encode(
x = "usg_pct",
y = "ts_pct",
color=alt.Color('pts', scale=alt.Scale(scheme='redyellowblue')),
tooltip=['team_abbreviation', "player_name", "age", "player_height", "ts_pct","usg_pct", "pts", "ast", "reb" ],
opacity=alt.condition(selection,alt.value(1),alt.value(.2))
).add_selection(selection).properties( title="NBA Players' Usage Percentage vs. True Shooting Percentage")
Scatter_plot.show()
C:\Users\leahf\AppData\Local\Temp\ipykernel_24836\3018149087.py:2: AltairDeprecationWarning: Deprecated since `altair=5.0.0`. Use 'selection_point()' or 'selection_interval()' instead. These functions also include more helpful docstrings. selection = alt.selection(type="single", fields=['True_Shooting_Percentage_Group'], bind=dropdown) C:\Users\leahf\AppData\Local\Temp\ipykernel_24836\3018149087.py:10: AltairDeprecationWarning: Deprecated since `altair=5.0.0`. Use add_params instead. ).add_selection(selection).properties( title="NBA Players' Usage Percentage vs. True Shooting Percentage")
dropdown = alt.binding_select (options=Filtered_NBA_DS['season'].unique(), name="Season Filter for the Scoring Efficency Scatter Plot: ")
selection = alt.selection(type="single", fields=['season'], bind=dropdown)
scatter = Scatter_plot.add_selection(
selection
).transform_filter(
selection
)
scatter.show()
C:\Users\leahf\AppData\Local\Temp\ipykernel_24836\150654175.py:2: AltairDeprecationWarning: Deprecated since `altair=5.0.0`. Use 'selection_point()' or 'selection_interval()' instead. These functions also include more helpful docstrings. selection = alt.selection(type="single", fields=['season'], bind=dropdown) C:\Users\leahf\AppData\Local\Temp\ipykernel_24836\150654175.py:4: AltairDeprecationWarning: Deprecated since `altair=5.0.0`. Use add_params instead. scatter = Scatter_plot.add_selection(
dropdown = alt.binding_select (options=Filtered_NBA_DS['season'].unique(), name="Season Filter for the Overall Contribution Scatter Plot: ")
selection = alt.selection(type="single", fields=['season'], bind=dropdown)
Scatter_plot_contribution = alt.Chart(Filtered_NBA_DS).mark_circle().encode(
x = "usg_pct",
y = "Overall_Statistical_Contribution",
color=alt.Color('Overall_Contribution_Group', scale=alt.Scale(scheme= "category10")),
tooltip=['team_abbreviation', "player_name", "age", "player_height", "ts_pct","usg_pct", "pts", "ast", "reb", "Overall_Statistical_Contribution"]).add_selection(selection).transform_filter(selection).properties( title="NBA Players' Usage Percentage vs. Overall Statistical Contribution")
C:\Users\leahf\AppData\Local\Temp\ipykernel_24836\2565185376.py:2: AltairDeprecationWarning: Deprecated since `altair=5.0.0`. Use 'selection_point()' or 'selection_interval()' instead. These functions also include more helpful docstrings. selection = alt.selection(type="single", fields=['season'], bind=dropdown) C:\Users\leahf\AppData\Local\Temp\ipykernel_24836\2565185376.py:8: AltairDeprecationWarning: Deprecated since `altair=5.0.0`. Use add_params instead. tooltip=['team_abbreviation', "player_name", "age", "player_height", "ts_pct","usg_pct", "pts", "ast", "reb", "Overall_Statistical_Contribution"]).add_selection(selection).transform_filter(selection).properties( title="NBA Players' Usage Percentage vs. Overall Statistical Contribution")
Scatter_plot_contribution.show()
Final Visualizations After Feedback:
histogram | scatter | Scatter_plot_contribution
Synthesis on Findings:
From the histogram, we confirmed that players with a high usage percentage also tend to score the most points on average. All of the visualizations confirmed that elite players in the NBA tend to have high usage percentages, and therefore, offenses tend to run through them. The scatter plot "NBA Players' Usage Percentage vs. True Shooting Percentage" also shows that players with high true shooting and usage percentages tend to be the league's elite scorers.
However, this is not enough to show that every elite player is there, because the scatter plot titled "NBA Players' Usage Percentage vs. Overall Statistical Contribution" includes Nikola Jokic as an elite player. The shooting efficiency plot shows Jokic as having a high average points per game and a high true shooting percentage, and therefore would be classified as a very good, if not elite, player. However, he is often overshadowed by those who score more points on that graph. If we were just to use the shooting efficiency graph, many players would be seen as much better than him. However, in reality, he is considered by many to be the best player in the NBA because he has higher rebounding and assist totals, which is why the usage percentage by overall statistical contribution plot was needed to add nuance to the shooting efficiency graph. Without the added nuance to the statistics, we may overlook players who are well-rounded overall. However, typically speaking, those who score a lot of points, have a high usage percentage, and have a high true shooting percentage will fall into the elite category on the overall statistical contribution plot. Still, there are edge cases like Jokic where it's harder to tell that he is, in fact, an elite player.
What worked well in my approach was taking in the feedback I received and turning more towards a tool that anybody can pick up and gain insights from. I also used color to my advantage after taking in the feedback, and now people can intuitively tell from my scatter plot which players tend to score the most points. It is a measurable improvement over the chart that used size to indicate which players scored the most points. In future iterations, I would try to find a way to have the season filters between the charts line up, because it's confusing to have the same filter in two charts that aren't aligned. You could be in the 2018-2019 season on one chart, and the 2022-2023 season on another chart. That is a data mix-up waiting to happen, but given the time constraints, I wasn't able to look into solving the issue. Secondly, in the future, as I gain more data skills, I want to create a chart that combines scoring, efficiency, and overall contribution statistics, so we only need one view to determine elite player status, since multi-views can become overwhelming to users. However, overall, the current visualizations can predict who is an elite player in the NBA and provide meaningful insights into their usage percentage, true shooting percentage, and overall statistical contributions to their team.